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Abstract. This work investigates the empirical performance of the sparse 
synthesis versus sparse analysis regularization for the ill-posed inverse 
problem of audio declipping. We develop a versatile non-convex heuris¬ 
tics which can be readily used with both data models. Based on this 
algorithm, we report that, in most cases, the two models perform almost 
similarly in terms of signal enhancement. However, the analysis version is 
shown to be amenable for real time audio processing, when certain anal¬ 
ysis operators are considered. Both versions outperform state-of-the-art 
methods in the field, especially for the severely saturated signals. 
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1 Introduction 

Clipping, or magnitude saturation, is a well-known problem in signal processing, 
from audio w to image processing [an] and digital communications m- 
The focus of this work is audio declipping, to restore clipped audio signals. 
Audio signals become saturated usually during acquisition, reproduction or A/D 
conversion. The perceptual manifestation of clipped audio depends on the level 
of clipping degradation and the audio content. In case of mild to moderate 
clipping, the listener may notice occasional “clicks and pops” during playback. 
When clipping becomes severe, the audio content is usually perceived as if it 
was contaminated with a high level of additive noise, which may be explained 
by the introduction of a large number of harmonics caused by the discontinuities 
in the degraded signal. In addition to audible artifacts, some recent studies have 
shown that clipping has a negative impact on Automatic Speech Recognition 
(ASR) performance [22III) . 

In the following text, a sampled audio signal is represented by the vector 
a; £ K” and its clipped version is denoted by y £ K". The latter can be easily 
deduced from x through the following nonlinear observation model, called hard 
clipping: 

y = (1) 

' |sign(a;i)T otherwise. 


This work was supported in part by the European Research Council, PLEASE 
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While idealized, this clipping model is a convenient approximation allowing to 
clearly distinguish the clipped parts of a signal by identifying the samples having 
the highest absolnte magnitude. Indices corresponding to “reliable” samples of 
y (not affected by clipping) are indexed by fir, while I?))" and index the 
clipped samples with positive and negative magnitude, respectively. 

Our goal is to estimate the original signal x from its clipped version y, i.e. 
to “declip” the signal y. Ideally, the estimated signal x should satisfy natural 
magnitude constraints in order to be consistent with the clipped observations. 
Thus, we seek an estimate x which fulfills the following criteria: 

Mr X = Mr y M^ X > M^ y Af“ x < M], y, (2) 

where the matrices Mr, M'^. and M'^ are restriction operators. These are simply 
row-reduced identity matrices used to extract the vector elements indexed by the 
sets fir, ^ respectively. We write the constraints ^ as x G r{y). 

Obviously, consistency alone is not sufficient to ensure uniqueness of x, thus 
one needs to further regularize the inverse problem. The declipping inverse prob¬ 
lem is amenable to several regularization approaches proposed in the literature, 
such as based on linear prediction |12] , minimization of the energy of high order 
derivatives [H], psychoacoustics [B], sparsity |1l1Kl2llbl2:i| and cosparsity m 
(where we introduced a simplified version of the analysis-based algorithm pre¬ 
sented in this paper). The last two priors, briefly explained in the next section, 
enable some state-of-the-art methods in clipping restoration. 

In this paper we empirically compare the performance of the two priors, 
by means of a declipping algorithm which is easily adaptable to both cases. 
Our findings are that the sparsity-based version of the algorithm marginally 
outperforms the cosparsity-based one, but this fact may be attributed to the 
choice of the stopping criterion. On the other hand, for a class of analysis 
operators, the cosparsity-based algorithm has very low complexity per iteration, 
which makes it suitable for real-time audio processing. 


2 The sparse synthesis and sparse analysis data models 

It is well-known that the energy of audio signals is often concentrated either in 
a small number of frequency components, or in short temporal bursts 
they are (approximately) time-frequency sparse. The traditional sparse synthesis 
viewpoint [819) on this property is that audio signals are well approximated 
by linearly combining few columns of a dictionary matrix D e d > n 

such as a Gabor dictionary, i.e. x k, Dz, where 2 : G is sparse. A less 
explored alternative is the cosparse analysis perspective m asserting that Ax 
is approximately sparse, with A G p>n and analysis operator. The 

two data models are different [mi, unless p = n and A — D Finding the 
sparsest (in the sense of synthesis or analysis) vector x satisfying constraints 
such as (H is in general intractable, but convex or greedy heuristics provide 
efficient algorithms with certain performance guarantees pnTl . 
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3 Algorithms 


Some empirical evidence ra suggests that standard £i convex relaxation does 
not perform well for sparse synthesis regularization of the declipping inverse 
problem. Therefore, we developed an algorithmic framework based on non- 
convex heuristics, that can be straightforwardly parametrized for use in both 
the synthesis and the analysis setting. To allow for possible real-time imple¬ 
mentation, the algorithms operate on individual blocks (chunks) of audio data, 
which is subsequently resynthesized by means of the overlap-add scheme. 

The heuristics should approximate the solution of the following synthesis- 
and analysis-regularized inverse problem^: 

minimize ||2:||o -I- lr(y){x) + - Dz) (3) 

X,Z ' ^ ~ 

minimize ||2:||o -f lr(y){x) + lt^<e{Ax - z). (4) 

X.Z ~ 


The indicator function lr( 3 /) of the constraint set r{y) forces the estimate x 
to satisfy The additional penalty 1^2<e is a coupling functional. Its role 
is to enable the end-user to explicitly bound the distance between the estimate 
and its sparse approximation. These are difficult optimization problems: besides 
inherited NP-hardness, the two problems are also non-convex and non-smooth. 

We can represent and in an equivalent form, using the indicator 
function on the cardinality of z and an integer-valued unknown k: 


minimize lig<k{z) + lr<y){x) + Fc{x,z) 

x.z^k 


( 5 ) 


where Fc{x, z) is the appropriate coupling functional. For a fixed k, problem ([5]) 
can be seen as a variant of the regressor selection problem, which is (locally) 
solvable by the Alternating Direction Method of Multipliers (ADMM) [5I3| : 


Synthesis version 
2 :^’“''^^ = argmin \\z — 

Z 

subject to Dz G r{y) 
«(i+l) =«(i) + 




Analysis version 
='Hk{Ax^^^ -I- M^‘^) 

= argmin ||Aa; — -|- 

X 




subject to a: G r[y) 


( 6 ) 


The operator Tikiv) performs hard thresholding, i.e. sets all but k highest in 
magnitude components of v to zero. Unlike the standard regressor selection 
algorithm, for which the ADMM multiplier needs to be appropriately chosen 
to avoid divergence, the above formulation is independent of its value. 

In practice, it is difficult to guess the optimal value of k beforehand. An 
adaptive estimation strategy is to periodically increase k (starting from some 

^ Observe that if D and A are unitary matrices, the two problems become identical. 
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small value), perform several runs of ([HI) for a given k and repeat the procedure 
until the constraint embodied by is satisfied. This corresponds to sparsity 
relaxation: as k gets larger, the estimated 2 becomes less sparse. 

The proposed algorithm, dubbed SParse Audio DEclipper (SPADE), comes 
in two flavors. The pseudocodes for the synthesis version {“S-SPADE") and for 
the analysis version {“A-SPADE”) are given in Algorithm [T] and Algorithmic] 


Algorithm 1 S-SPADE Algorithm 2 A-SPADE 


Require: D,y,Mr,Mt 

Require: A,y, Mr, Mf, Ml,s,i,e 

1 

= D^y, = 0, i = 1, k = s 

1 

a;l°) = y^ n^°) = 0, i = 1, k = s 

2 


2 

zd) = Rk ("Aa;d-i) _(_ ud-i)'! 

3 

V / 

id) = arg min^llz - z^) -|- 

3 

a;d) = arg min,^, || Aa; — zd) -|- ud“^) ||| 


s.t. X = Dz £ r 


s.t. X £ r 

4 

if pd) -id )||2 < £ then 

4 

if ||Aa;d) _ zd )||2 < e then 

5 

terminate 

5 

terminate 

6 

else 

6 

else 

7 


7 

ud) = ud“^) -|~ Aa;d) _ ^(O 

8 

i ■(— i -1- 1 

8 

i ■<— i + 1 

9 

if i mod r = 0 then 

9 

if i mod r = 0 then 

10 

k •<— k -1- s 

10 

k k -1- s 

11 

end if 

11 

end if 

12 

go to 2 

12 

go to 2 

13 

end if 

13 

end if 

14 

retnrn x = 

14 

retnrn x = a;d) 


The relaxation rate and the relaxation stepsize are controlled by the integer¬ 
valued parameters r > 0 and s > 0, while the parameter £ > 0 is the stopping 
threshold. 

Lemma 1 The SPADE algorithms terminate in no more than i = [dr/s -I- 1] 
iterations. 

Proof. Once k > d, the hard thresholding operation Hk becomes an identity 
mapping. Then, the minimizer of the constrained least squares step 3 is 
(respectively, and the distance measure in the step 4 is equal to [ 2 . 

But, in the subsequent iteration, = 0 and the algorithm terminates. 

This bound is quite pessimistic: in practice, we observed that the algorithm 
terminates much sooner, which suggest that there might be a sharper upper 
bound on the iteration count. 


4 Computational aspects 

The general form of the SPADE algorithms does not impose restrictions on the 
choice of the dictionary nor the analysis operator. From a practical perspective, 
however, it is important that the complexity per iteration is kept low. The 
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dominant cost of SPADE is in the evaluation of the linearly constrained least 
squares minimizer step, whose computational complexity can be generally high. 
Fortunately, for some choices of D and A this cost is dramatically reduced. 




The projection Ps{-) is straightforward and corresponds to component-wise 
mappings, thus the per iteration cost of the algorithm is reduced to the cost 
of evaluating matrix-vector products. 

On the other hand, for S-SPADE this simplification is not possible and the 
constrained minimization in step 3 needs to be computed iteratively. However, 
by exploiting the tight frame property oi D = A^ and the Woodbury matrix 
identity, one can build an efficient algorithm that solves this optimization prob¬ 
lem with low complexity. 

Finally, the computational cost can be further reduced if the matrix-vector 
products with D and A can be computed with less than quadratic cost. Some 
transforms that support both tight frame property and fast product computation 
are also favorable in our audio (co)sparse context. Such well-known transforms 
are Discrete Fourier Transform, (Modified) Discrete Cosine Transform, (Modi¬ 
fied) Discrete Sine Transform and Discrete Wavelet Transform, for instance. 

5 Experiments 

The experiments are aimed to highlight differences in signal enhancement perfor¬ 
mance between S-SPADE and A-SPADE, and implicitly, the sparse and cosparse 
data models. It is noteworthy that in the formally equivalent setting (A = D~^), 
the two algorithms become identical. As a sanity-check, we include this setting 
in the experiments. The relaxation parameters are set to r = 1 and s = 1, and 
the stopping threshold is e = 0.1. 

In addition to SPADE algorithms, we also include Consistent IHT [T^ and 
social sparsity declipping algorithm m as representatives of state-of-the-art. 
The latter two algorithms use the sparse synthesis data model for regularizing 
the declipping inverse problem. Consistent IHT is a low-complexity algorithm 
based on famous Iterative Hard Thresholding for Compressed Sensing [i] , while 
the social sparsity declipper is based on a structured sparsity prior |16) . 

As mentioned before, this work is not aimed towards investigating the appro¬ 
priateness of various time-frequency transforms in the context of audio recovery, 
which is why we choose traditional Short Time Fourier Transform (STFT) for 
all experiments. We use sliding square-rooted Hamming window of size 1024 


^ Recall that the matrices Mr, M^ and Af“ are tight frames by design. 
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samples with 75% overlap. The redundancy level of the involved frames (corre¬ 
sponding to per-chunk inverse DFT for the dictionary and forward DFT for the 
analysis operator) is 1 (no redundancy), 2 and 4. The social sparsity declipper, 
based on Gabor dictionary, requires batch processing of the whole signal. We 
adjusted the temporal shift, the window and the number of frequency bins in 
accordance with previously mentioned STFT settings d. 

For a measure of performance, we use a simple difference between 
signal-to-distortion ratios of clipped (SDR^) and processed (SDRa,) signals: 



Hence, only the samples corresponding to clipped indices are taken into account. 
Concerning SPADE, this choice makes no difference, since the remainder of the 
estimate x perfectly fits the observations y. However, it may favor the other 
two algorithms that do not share this feature. 

Audio examples consist of 10 music excerpts taken from RWC database m. 
which significantly differ in tonal and vocal content. The excerpts are of approx¬ 
imately similar duration (^ 10s), and are sampled at 16kHz with 16bit encoding. 
The inputs are generated by artificially clipping the audio excerpts at five levels, 
ranging from severe (SDR^ = IdB) towards mild (SDR^ = lOdB). 

According to the results presented in figure (TJ the SPADE algorithms yield 
highest improvement in SDR among the four considered approaches. As as¬ 
sumed, 

S-SPADE and A-SPADE achieve similar results in a non-redundant setting, but 
when the overcomplete frames are considered, the synthesis version performs 
somewhat better. Interestingly, the overall best results for the analysis ver¬ 
sion are obtained for the twice-redundant frame, while the performance slightly 
drops for the redundancy four. This is probably due to the absolute choice of 
the parameter e, and suggests that in the analysis setting, this value should be 
replaced by a relative threshold instead. In the non-redundant case, declipping 
by A-SPADE and Consistent IHT took (on the average) 3min and 7min, re¬ 
spectively, while the other two algorithms were much slowei0 (on the order of 
hours). 

6 Conclusion 

We presented a novel algorithm for non-convex regularization of the declipping 
inverse problem. The algorithm is flexible in terms that it can easily accom¬ 
modate sparse (S-SPADE) or cosparse (A-SPADE) prior, and as such has been 
used to compare the recovery performance of the two data models. The empiri¬ 
cal results are slightly in favor of the sparse synthesis data model. However, the 

® We use the implementation kindly provided by the authors. 

All algorithms were implemented in Matlab®, and run in single-thread mode. 
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analysis version does not fall far behind, which makes it attractive for practical 
applications. Indeed, due to the natural way of imposing clipping consistency 
constraints, it can be implemented in an extremely efficient way, even allowing 
for a real-time signal processing. Benchmark on real audio data demonstrates 
that both versions outperform considered state-of-the-art algorithms in the field. 


Synthesis SPADE 
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Fig. 1. Declipping performance in terms of the SDR improvement. 


Future work will be dedicated to theoretical analysis of the algorithm, with 
emphasis on convergence. A possible extension is envisioned by introducing 
structured (co)sparsity priors in the presented algorithmic framework. 
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